5.2 MLE

1 MLE

For a generic dominated family $P = {P_{θ} | θ \in Θ}$ with densities $p_{θ}$ , a simple estimator (maximum likelihood estimator, MLE) for $θ$ is ${\hat{θ}}_{MLE} (X) = \arg max_{θ \in Θ} p_{θ} (x) = \arg max_{θ \in Θ} l (θ; X) .$

Argmax may not exist, be unique, or be computable.

It doesn't depend on parameterization or base measure; MLE for $g (θ)$ is $g ({\hat{θ}}_{MLE})$ .

Example (Exponential Families)

$p_{η} (x) = e^{η^{T} T (x) - A (η)} h (x)$ . So $l (η; x) = η^{T} T (x) - A (η) + \log h (x)$ . $\nabla l (η; x) = T (x) - E_{η} T (x)$ , so ${\hat{η}}_{MLE}$ should solve $T = E_{\hat{η}} (T)$ if it exists.
Since $\nabla^{2} l (η; x) = - {Var}_{η} (T)$ is negative definite unless $ν^{T} T \overset{a . s .}{=} 0$ (in which case parameters are redundant), then at most $1$ solution exists. Let $μ = ψ (η) = \nabla A (η)$ , then $\hat{η} = ψ^{- 1} (T)$ .

Example (Exponential Families, multiple variables)

$X_{i} \overset{i . i . d}{\sim} e^{η T (x) - A (η)} h (x)$ , $η \in Ξ \subset R$ . So $\hat{η} = ψ^{- 1} (\overset{―}{T})$ , with $\overset{―}{T} = \frac{1}{n} \sum_{i = 1}^{n} T (X_{i})$ .
Assume $η \in Ξ^{\circ}$ . $ψ^{'} (η) = A^{″} (η) > 0$ , $\forall η \in Ξ^{\circ}$ , (notation of $ψ$ is defined in the last example), so $ψ^{- 1}$ continuous, and $(ψ^{- 1})^{'} (μ) = \frac{1}{ψ^{'} (ψ (μ))} = \frac{1}{A^{″} (η)} .$ By consistency, $\overset{―}{T} \overset{p_{η}}{\to} μ$ ; by continuous mapping, $ψ^{- 1} (\overset{―}{T}) \overset{p_{η}}{\to} ψ^{- 1} (μ) = η$ .
Since $\sqrt{n} (\overset{―}{T} - μ) \Rightarrow N (0, {Var}_{η} (T (X_{1}))) = N (0, A^{″} (η)) .$ (Recall $J_{1} (η) = Var (T)^{- 1} = A^{″} (η)^{- 1}$ .)
By Delta method, $\begin{aligned} \sqrt{n} (\hat{η} - η) & = \sqrt{n} (ψ^{- 1} (\overset{―}{T}) - η) \\ \Rightarrow N (0, \frac{1}{A^{″} (η)^{2}} \cdot A^{″} (η)) \\ = N (0, \frac{1}{A^{″} (η)}) . \end{aligned}$
Recall $J_{1} (η) = {Var}_{η} (T (X_{i}))$ . So $\hat{η} \approx N (η, \frac{1}{n J_{1} (η)}) .$ Asymptotically unbiased, Gaussian achieves CRLB.

Example (Poisson)

$X_{1}, \dots, X_{n} \overset{i . i . d}{\sim} Poisson (θ), η = \log θ$ . $\hat{η} = \log \overset{―}{X}$ , $\sqrt{n} (\overset{―}{X} - θ) \Rightarrow N (0, θ)$ . Again by Delta method, $\begin{aligned} \sqrt{n} (\hat{η} - η) & = \sqrt{n} (\log \overset{―}{X} - \log θ) \\ \Rightarrow N (0, θ \cdot \frac{1}{θ^{2}}) = N (0, θ^{- 1}) . \end{aligned}$
But for $\forall n < \infty, θ > 0$ : $P_{θ} (\hat{η} = - \infty) = P_{θ} (X_{1} = 0)^{n} = e^{- θ n} > 0.$

The above example shows that, MLE can have embarrassing finite-sample performance despite being asymptotically optimal.

Proposition

If $P (B_{n}) \to 0$ , $X_{n} \Rightarrow X$ , $Z_{n}$ arbitrary, then $X_{n} 1_{B_{n}^{c}} + Z_{n} 1_{B_{n}} \Rightarrow X .$

Proof

$P (| | Z_{n} 1_{B_{n}} | | > ε) \leq P (B_{n}) \to 0$ , so $Z_{n} 1_{B_{n}} \overset{p}{\to} 0$ . Also $1_{B_{n}^{c}} \overset{p}{\to} 1$ , apply Slutsky.

2 Asymptotic Efficiency

The nice behavior of MLE we found in the exponential family case generalizes to a much broader class of models.

Setting: $X_{1}, \dots, X_{n} \overset{i . i . d}{\sim} p_{θ} (x)$ , $θ \in Θ \subset R^{d}$ . $p_{θ}$ is "smooth" in $θ$ .
Let $l_{1} (θ; X_{i}) = \log p_{θ} (X_{i})$ , $l_{n} (θ; X) = \sum_{i = 1}^{n} l_{1} (θ; X_{i})$ . Then $\begin{aligned} J_{1} (θ) & = {Var}_{θ} (\nabla l_{1} (θ; X_{i})) = - E_{θ} [\nabla^{2} l_{1} (θ; X_{i})], \\ J_{n} (θ) & = {Var}_{θ} (\nabla l_{n} (θ; X)) = n J_{1} (θ) . \end{aligned}$
We say an estimator ${\hat{θ}}_{n}$ is asymptotically efficient if $\sqrt{n} ({\hat{θ}}_{n} - θ) \overset{p_{θ}}{\Rightarrow} N (0, J_{1} (θ)^{- 1}) .$
Delta method for differentiable estimand $g (θ)$ : $\sqrt{n} (g ({\hat{θ}}_{n}) - g (θ)) \overset{P_{θ}}{\Rightarrow} N (0, \nabla g (θ)^{T} J_{1} (θ)^{- 1} \nabla g (θ))$ also achieves CRLB if ${\hat{θ}}_{n}$ does.

3 Asymptotic Distribution of MLE

Under mild conditions, ${\hat{θ}}_{MLE}$ is asymptotically Gaussian, and efficient. We will be interested in $l_{n} (θ; X)$ as a function of $θ$ . Notate "true" value as $θ_{0}$ ( $X \sim P_{θ_{0}}$ )
Then for $θ_{0} \in Θ^{\circ}$ , $\begin{aligned} \nabla l_{1} (θ_{0}; X_{i}) \overset{i . i . d}{\sim} (0, J_{1} (θ_{0})), \\ \frac{1}{\sqrt{n}} \nabla l_{n} (θ_{0}; X) = \sqrt{n} \cdot \frac{1}{n} \sum_{i = 1}^{n} \nabla l_{1} (θ_{0}; X_{i}) \overset{P_{θ_{0}}}{\Rightarrow} N (0, J_{1} (θ_{0})) \\ \frac{1}{n} \nabla^{2} l_{n} (θ_{0}; X) \overset{P_{θ_{0}}}{\to} E_{θ_{0}} \nabla^{2} l_{1} (θ_{0}; X_{i}) = - J_{1} (θ_{0}) . \end{aligned}$

4 Consistency of MLE

$X_{1}, \dots, X_{n} \overset{i . i . d}{\sim} p_{θ_{0}}$ , ${\hat{θ}}_{n} \in \arg max_{θ \in Θ} l_{n} (θ; X)$ . ^[1] The question is when does ${\hat{θ}}_{n} \overset{p}{\to} θ_{0}$ ?

Recall KL divergence: $D_{KL} (θ_{0} | | θ) = E_{θ_{0}} \log \frac{p_{θ_{0}} (X_{i})}{p_{θ} (X_{i})} \geq 0$ . Let $W_{i} (θ) = l_{1} (θ; X_{i}) - l (θ_{0}; X_{i})$ , ${\overset{―}{W}}_{n} = \frac{1}{n} \sum_{i = 1}^{n} W_{i}$ . Note ${\hat{θ}}_{n} \in \arg max_{θ \in Θ} {\overset{―}{W}}_{n} (θ)$ too. $\begin{matrix} (4.1) & {\overset{―}{W}}_{n} (θ) \overset{p}{\to} E_{θ_{0}} W_{i} (θ) = - D_{KL} (θ_{0} | | θ) \leq 0, \end{matrix}$ "=" iff $θ = θ_{0}$ . But this is not enough:

MLE ${\hat{θ}}_{n}$ depends on entire function ${\overset{―}{W}}_{n} (\cdot)$ .
Need uniform convergence in $θ$ .

Convergence of function series

For compact $K$ , let $C (K) = {f : K \to R, continuous}$ . For $f \in C (K)$ , let $| | f | |_{\infty} = sup_{t \in K} | f (t) |$ . Denote $f_{n} \overset{(p)}{\to} f$ in this norm if $| | f_{n} - f | |_{\infty} \overset{(p)}{\to} 0$ .

Theorem (Uniform LLN)

Assume $K$ is compact. $W_{1}, W_{2}, \dots \in C (K)$ iid. $E | | W_{i} | |_{\infty} < \infty$ , $μ (t) = E W_{i} (t)$ . Then $μ (t) \in C (K)$ , and $P ({‖ \frac{1}{n} \sum_{i = 1}^{n} W_{i} - μ ‖}_{\infty} > ε) \to 0.$
(i.e. $| | {\overset{―}{W}}_{n} - μ | |_{\infty} \overset{p}{\to} 0$ )

Theorem (Consistency of MLE for Compact

Θ

)

$X_{1}, \dots, X_{n} \overset{i . i . d}{\sim} p_{θ_{0}}$ . $P$ has densities $p_{θ}$ , $θ \in Θ$ . Assume:

$\log p_{θ} (x)$ is continuous in $θ$ , $\forall x \in X$ .
$Θ$ is compact.
$E_{θ_{0}} [sup_{θ \in Θ} | W_{i} (θ) |] < \infty$ .
Model identifiable.

Then ${\hat{θ}}_{n} \overset{p}{\to} θ_{0}$ if ${\hat{θ}}_{n} \in \arg max_{θ \in Θ} l_{n} (θ; X)$ .

Proof

$W_{i} \in C (Θ)$ iid, mean $μ (θ) = - D_{KL} (θ_{0} | | θ)$ , $μ (θ_{0}) = 0, μ (θ) < 0, \forall θ \neq θ_{0}$ (because $θ_{0} = \arg min μ$ )
By definition, ${\hat{θ}}_{n}$ maximizes ${\overset{―}{W}}_{n}$ , and $δ_{n} = | | {\overset{―}{W}}_{n} - μ | |_{\infty} \overset{p}{\to} 0$ .
Fix $ε > 0$ , want to show $P_{θ_{0}} (| | θ - θ_{0} | | \geq ε) \to 0$ .
Let ${\tilde{Θ}}_{ε} = Θ ∖ B_{ε} (θ_{0}) = {θ \in Θ ∣ | | θ - θ_{0} | | \geq ε}$ compact. Let $μ_{ε}^{*} = max_{θ \in {\tilde{Θ}}_{ε}} μ (θ) < 0 = μ (θ_{0})$ , $W_{ε}^{*} = max_{θ \in {\tilde{Θ}}_{ε}} {\overset{―}{W}}_{n} (θ)$ , then $\begin{aligned} P_{θ_{0}} & \leq P_{θ_{0}} (W_{ε}^{*} \geq {\overset{―}{W}}_{n} (θ_{0})) \\ \leq P_{θ_{0}} (2 δ_{n} \geq - μ_{ε}^{*}) \to 0. \end{aligned}$

We usually care about non-compact parameter spaces, so need some extra assumption to get us there.

Corollary

Same assumptions except now $Θ = R^{d}$ (non-compact), but there is some $R < \infty$ large engou so $P_{θ_{0}} (| | {\hat{θ}}_{n} - θ_{0} | | > R) \to 0$ , then ${\hat{θ}}_{n} \overset{p}{\to} θ_{0}$ .

Proof

Let $\tilde{Θ} = {θ ∣ | | θ - θ_{0} | | \leq R}$ , ${\tilde{θ}}_{n} = \arg max_{θ \in \tilde{Θ}} p_{θ} (X)$ , then ${\tilde{θ}}_{n} \to θ_{0}$ by assumption. Since $P_{θ_{0}} ({\hat{θ}}_{n} \neq {\tilde{θ}}_{n}) = P_{θ_{0}} ({\hat{θ}}_{n} \notin \tilde{Θ}) \to 0,$ so ${\hat{θ}}_{n} - {\tilde{θ}}_{n} \overset{p}{\to} 0$ , so ${\hat{θ}}_{n} = {\tilde{θ}}_{n} + ({\hat{θ}}_{n} - {\tilde{θ}}_{n}) \overset{p}{\to} θ_{0} .$

So the only thing we actually need to worry about is if ${\hat{θ}}_{n}$ is extremely far away from $θ_{0}$ with non-negligible probability.

Theorem (Asymptotic Distribution of MLE)

$X_{1}, \dots, X_{n} \overset{i . i . d}{\sim} p_{θ_{0}}$ , $P$ has densities $p_{θ}, θ \in Θ$ .
Assume

$P$ identifiable.
$Θ$ compact.
$E_{θ_{0}} [sup_{θ \in Θ} | W_{i} (θ) |] < \infty$ .
$l (θ; X) = \log p_{θ} (X)$ has two continuous derivatives in $θ$ .
$E_{θ_{0}} [sup_{θ \in Θ} | | \nabla^{2} l_{1} (θ; X_{i}) | |] < \infty$ .
$J_{1} (θ_{0}) = E_{θ_{0}} \nabla^{2} l_{1} (θ_{0}; X_{i})$ is positive definite.

Then $\sqrt{n} ({\hat{θ}}_{n} - θ_{0}) \Rightarrow N (0, J_{1} (θ_{0})^{- 1}) .$

Proof

From before, we had $\sqrt{n} ({\hat{θ}}_{n} - θ_{0}) = {(- \frac{1}{n} \nabla^{2} l_{n} ({\tilde{θ}}_{n}))}^{- 1} \nabla l_{n} (θ_{0})$ for ${\tilde{θ}}_{n}$ between $θ_{0}$ and ${\hat{θ}}_{n}$ . Previous result shows ${\hat{θ}}_{n} \overset{p}{\to} θ_{0}$ , so ${\tilde{θ}}_{n} \overset{p}{\to} θ_{0}$ also.
Define $V_{i} (θ) = - \nabla^{2} l_{1} (θ; X_{i}) \in C (Θ)$ , $E_{θ_{0}} | | V_{i} | |_{\infty} < \infty$ by assumption. Then $v (θ) = E_{θ_{0}} V_{i} (θ) \in C (Θ)$ , $v (θ_{0}) = J_{1} (θ_{0})$ , ${\overset{―}{V}}_{n} (θ) = \frac{1}{n} \sum_{i = 1}^{n} V_{i} (θ)$ , $| | {\overset{―}{V}}_{n} - v | |_{\infty} \overset{p}{\to} 0$ . So $\begin{aligned} ‖ - \frac{1}{n} \nabla^{2} l_{n} ({\tilde{θ}}_{n}) - J_{1} (θ_{0}) ‖ & \leq | | {\overset{―}{V}}_{n} ({\tilde{θ}}_{n}) - v ({\tilde{θ}}_{n}) | | + | | v ({\tilde{θ}}_{n}) - v (θ_{0}) | | \\ \leq \underset{\overset{p}{\to} 0}{\underset{⏟}{| | {\overset{―}{V}}_{n} - v | |_{\infty}}} + \underset{\overset{p}{\to} 0 (*)}{\underset{⏟}{| | v ({\tilde{θ}}_{n}) - v (θ_{0}) | |}} \end{aligned}$
(*) is by continuous mapping. Hence by continuous mapping ${(- \frac{1}{n} \nabla^{2} l_{n} ({\tilde{θ}}_{n}))}^{- 1} \overset{p}{\to} J_{1} (θ_{0})^{- 1} .$ By Slutsky, $\sqrt{n} ({\hat{θ}}_{n} - θ_{0}) \Rightarrow N_{d} (0, J_{1} (θ_{0})^{- 1}) .$

Will be OK if ${\hat{θ}}_{n}$ comes close to maximizing $l_{n}$ . ↩︎